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ABSTRACT 

This  paper  describes  the  implementation  of  three  standard  matrix 
eigenvalue  computation  methods  on  an  array  machine  with  high  efficiency. 
A  brief  description  of  the  ILLIAC  IV  computer  is  provided  as  background 
material.   Three  major  sections  follow- -the  first  two  describe  Jacobi  and 
Householder  algorithms  for  real  symmetric  matrices,  and  the  third  describes 
the  ^R  algorithm  for  real  nonsymmetric  matrices.   Each  of  these  sections  is 
divided  into  four  parts.   The  theoretical  background  of  the  method  is  pre- 
sented first.   An  ILLIAC  IV  implementation  of  the  algorithm  is  presented  and 
timing  estimates  are  included.   Next,  the  efficiency  (ratio  of  number  of 
computations  actually  performed  to  number  of  computations  possible  in  a 
given  time)  of  the  computation  on  an  array  machine  is  derived  for  primary 
memory  contained  matrices.   Finally,  eigenvalue  computations  for  matrices  too 
large  to  be  contained  in  primary  memory  are  discussed  in  terms  of  a  head-per- 
track  secondary  storage  device. 

It  is  shown  that  for  Jacobi  and  Householder  algorithms,  parallel 
computation  efficiencies  of  c)Q°j0   are  possible,  while  those  of  the  Q,R  algorithm 
are  rather  low. 
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1.   INTRODUCTION 

The  implementations  on  a  parallel  computer  of  three  of  the  most 
commonly  used  techniques  for  finding  the  eigenvalues  and  eigenvectors  of 
matrices  are  discussed.   These  are  Jacobi's,  Householder's  and  the  QR  method. 
Jacohi's  algorithm  for  real  symmetric  matrices  was  modified  in  order  to  take 
advantage  of  parallelism  in  such  a  way  that  instead  of  two,  only  one  ortho- 
gonal transformation  would  eliminate  n  off-diagonal  elements,  where  n  is  the 
size  of  the  matrix.   Storage  schemes  for  all  three  methods  on  the  parallel 
computer  are  discussed  in  detail.   It  is  well  known  that  Jacobi's  algorithm 
proved  to  be  highly  efficient  for  parallel  computation;  yet,  its  use  is 
recommended  primarily  when  evaluation  of  all  the  eigenvalues  is  required. 
Even  then,  if  high  accuracy  is  also  desired,  it  should  be  used  in  conjunction 
with  a  scheme  for  establishing  machine  bounds  for  the  eigensystem. 

Householder's  algorithm  for  tridiagonalization  of  a  real  symmetric 
matrix  has  also  proven  to  be  very  efficient  on  a  parallel  computer;  a  parallel 
modification  of  the  method  of  bisection  can  be  used  on  the  resulting  tri- 
diagonal  matrix  to  obtain  one  or  two  eigenvalues  or  to  investigate  the  distri- 
bution of  all  of  them.   If,  however,  all  the  eigenvalues  are  required,  the  QR 
method  is  usually  preferred.   The  QR  algorithm  is  discussed  here  only  for  non- 
symmetric  matrices.   Implementation  of  this  method  on  the  parallel  machine 
proved  to  be  less  efficient  than  the  other  two,  though  it  is  the  most  reliable 
for  obtaining  the  eigenvalues  of  any  matrix.   For  Householder's  or  the  QR 
algorithm,  the  eigenvectors  of  the  tridiagonal  or  the  upper  Hessenberg  mat- 
rices may  be  obtained  by  the  method  of  inverse  iteration.   If  the  transfor- 
mation matrices  used  in  reducing  the  matrix  to  the  condensed  form  are  stored, 
the  eigenvectors  of  the  original  matrix  can  be  readily  computed. 
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2.   ILLIAC  IV  OVERVIEW 

2. 1  System  Organization 

ILLIAC  IV  is  an  array  of  coupled  computers  driven  by  instructions 
from  a  common  control  unit.    Each  of  the  processing  elements  (PE's)  has  20^-8 
words  of  64-bit  memory  with  a  cycle  time  under  350  nanoseconds.   Each  PE  is 
capable  of  64-bit  floating  point  multiplication  in  about  500  nanoseconds  and 
addition  in  about  350  nanoseconds.   Two  32-bit  floating  point  operations  may 
be  performed  in  each  PE  in  approximately  the  same  time.   The  PE  instruction 
set  is  similar  to  that  of  conventiqnal  machines,  with  two  exceptions.   First, 
the  PE's  are  capable  of  communicating  data  to  four  neighboring  PE's  by  means 
of  routing  instructions .   Second,  the  PE ' s  are  able  to  set  their  own  mode 
registers  to  effectively  disable  or  enable  themselves. 

Figure  1  shows  6k   PE's,  each  having  three  arithmetic  registers 
(A,  B,  and  C)  and  one  protected  programmer  register  (s).   The  registers,  words, 
and  paths  in  Figure  1  are  all  6k   bits  wide,  except  the  PE  index  registers  (XR),; 
mode  registers,  and  as  noted.   The  mode  register  may  be  regarded  as  one  bit 
which  may  be  used  to  block  the  participation  of  its  PE  in  any  action.   The 
routing  registers  are  shown  connected  to  neighbors  at  distances  of  +  1  and  +  8; 
similar  end-around  connections  are  provided  at  1,  6k,    etc.   Programs  and  data 
are  stored  in  PE  memory.   Instructions  are  fetched  by  the  control  unit  (CU)  as 
required  from  PE  memory. 

Figure  1  also  shows  the  essential  registers  and  paths  in  the  CU  and 
their  relations  to  the  PE's.   Instructions  are  decoded  and  control  signals  are 
sent  to  the  PE  array  from  the  control  unit.   Some  instructions  are  executed 
directly  in  the  CU,  e.g.,  the  loading  of  CU  accumulator  registers  (CAR's)  with 
program  literals.   Operand  addresses  may  be  indexed  once  in  the  CU  by  a  CAR  am 
again  separately  in  each  PE  by  XR.   It  is  possible  to  load  the  local  data 
buffer  (6k   words  of  6k   bits  each)  and  CAR's  from  PE  memory.   Local  data  buffer 
registers  and  CAR's  may  be  loaded  from  each  other.   A  broadcast  instruction 
allows  the  contents  of  a  CAR  to  be  transmitted  simultaneously  to  all  PE's.   It 


A  64-PE  ILLIAC  IV  system  is  expected  to  be  operating  soon.   For  a  more  com- 
plete description  of  the  system  see  [1]. 
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is  often  convenient  to  manipulate  all  PE  mode  bits  or  a  number  from  one  PE  in 
a  CAR.   For  this  purpose,  the  broadcast  path  is  bi-directional. 

The  complete  ILLIAC  IV  system  includes  a  Burroughs  6500  which  per- 
forms input/output  operations,  compilation,  and  contains  an  operating  system 

9 
to  control  ILLIAC  IV.   There  is  a  10  bit,  head-per-track  disk  with  a  kO 

9  1 

millisecond  rotation  speed  and  an  effective  transfer  rate  of  10  bits/ second. 

This  allows  the  loading  of  the  8  x  10  bit  ILLIAC  IV  memory  from  the  disk  in 
32  milliseconds.   The  input/output  controller  (IOC)  contains  a  queuer  for 
disk  requests  and  2   bit  buffer  memory  to  smooth  transmissions  to  and 
from  the  slower  B65OO  memory,  or  an  external  input/ output  device. 

2.2  Programming  ILLIAC  IV 

It  is  apparent  that  ILLIAC  IV  may  be  suited  to  the  execution  of 
algorithms  defined  on  arrays  of  data  (e.g.,  matrix  operations  and  certain 
partial  differential  equations).   However,  there  are  two  major  difficulties 
in  using  such  a  machine.   One  is  that  an  algorithm  must  be  devised  which 
allows  identical  computations  to  be  performed  simultaneously  on  a  large  num- 
ber of  operands.   The  other,  closely  related  to  the  first,  is  that  the  data 
must  be  so  placed  in  the  ILLIAC  IV  memory  that  over  the  course  of  the  calcu- 
lation few  PE's  are  disabled.   In  other  words,  the  algorithm  and  the  data 
storage  allocation  must  be  mutually  designed  for  highly  parallel  computation. 
For  this  reason,  ILLIAC  IV  codes  are  expressed  as  two  parts,  a  storage  allo- 
cation block,  followed  by  a  computational  algorithm  block.   These  codes  may 
be  written  in  a  high  level  language.   See,  for  example,  [2]. 


-h- 


3-   JACOBI'S  METHOD 

3.1  Theoretical  Background 

The  classical  Jacob i  method  reduces  a  symmetric  matrix  to  the 
diagonal  form  by  a  series  of  orthogonal  transformations,  [3]> 
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~Y*  "Y*  "Y* 


(0) 


Each  transformation  eliminates  two  off-diagonal  elements.   It  is  possible, 
however,  to  eliminate  n  off-diagonal  elements  by  one  orthogonal  transforma- 
tion, where  n  is  the  order  of  the  matrix  A.   This  can  be  achieved  if  the 

transformation  matrices  cp   are  of  the  form, 
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Therefore,  the  matrix  A  _  will  consist  of  2  x  2  submatrices  of  the  form 
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if  the  elements  of  the  transformation  submatrix  T,  are  chosen  such  that, 
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will  be   of  the   form: 
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Therefore,  after  one  transformation,  the  off-diagonal  elements 

(  a  \     =  fa   |    ,  where  p  =  (0,  2,  ^,  .  .  .  ,  n  -  2),  and  q  =  p  +  1, 
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are  eliminated.   In  order  to  prepare  the.  matrix  for  another  transformation, 
the  zero  elements  in  the  diagonal  submatrices  are  replaced  by  carrying  out 
the  following  permutation: 

t 


r+1        r+1 


(h) 


-6- 


where  ¥ 
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,  Y¥     =  I,  [I]  being  the  identity  matrix,  and 


(1,0),  eQ  =   (0,1) 


In  effect,  this  permutation  shifts  the  second  row  to  the  bottom  and  the 

second  column  to  the  far  right.   The  matrix  A   -,  is  then  ready  for  another 

r+1 


transformation  A 


r+2 


cp  ,  A  .,  cd  ,  •   It  can  easily  be  shown  that  all  the 
^r+1  r+1  r+1 


off-diagonal  elements  will  be  exposed  to  this  process  of  elimination  after 
every  (n  -  l)  orthogonal  transformation,  if  the  first  row  and  first  column 
are  shifted  to  the  bottom  and  far  right,  respectively.   This  second  kind  of 
shifting  is  equivalent  to  the  permutation 
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To  check  for  convergence  the  sum,  E,  of  squares  of  the  off-diagonal 
elements  is  compared  to  the  sum,  D,  of  squares  of  the  diagonal  elements 
after  every  orthogonal  transformation.   If  the  ratio  r\  =   E/D  is  less  than 
an  arbitrarily  small  number  £,  then  the  diagonal  elements  of  the  matrix  A 


m 


are  assumed  to  have  converged  to  the  eigenvalues  of  the  original  matrix  A. 
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In  general,  each  diagonal  element 
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will  lie  in  an  interval  of  width 


2s/ E.   The  practically  diagonal  matrix  A  will  be  of  the  form: 
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where 
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=  (cp  -,   .  .  r  cp  n 


y  co2  y 
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is  the  matrix  whose  columns  are  the  eigenvectors. 


3- 2   Implementation 

Consider  the  implementation  of  Jac obi 's. method  on  a  16-PE  machine 
using  the  storage  allocation  scheme  of  Figure  2  for  an  upper  triangular 
16  x  16  matrix.   The  mapping  of  element  a.  .  into  PE  memory  follows,,  where 
P_  (assumed  even)  is  the  number  of  PE's  being  used  (ixl  is  the  greatest 
integer  <  x) : 
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This  storage  allocation  scheme  is  used  because  of  the  negligible  amount 
of  routing  involved  in  implementing  the  Jacobi  process.  A  step-by-step 
discussion  of  the  process  follows. 

First,  consider  the  calculation  of  the  transformation  matrices  of 
Equation  1  according  to  Equations  2  and  3.   The  transformation  matrices  may 
be  computed  in  the  pairs  of  PE's  shown  in  Figure  3.   Proper  a. .  elements  may 
be  accessed,  as  shown  in  Figure  3,    in  three  memory  cycles  plus  fewer  than  ten 
unit  routes  (distance  1  or  k)    under  mode  control.   This  entire  time  is  negli- 
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gible  with  respect  to  the  times  for  the  calculations  which  follows. 

The  computations  of  Equations  2  and  3  are  carried  out  in  the  pairs 
of  PE's  shown  in  Figure  3-   The  time  required  for  these  is  less  than  15 
microseconds. 

The  major  part  of  the  computation  time  is  consumed  in  carrying  out 
the  transformations  indicated  by  Equation  0.   This  computation  involves  a 
number  of  2  x  2  matrix  multiplications,  and  it  is  important  to  see  that  these 
may  be  performed  efficiently  using  the  storage  allocation  scheme  of  Figure  2. 
Notice  that  for  0  <  i  <  7,  a   and  a.  .  n  are,  in  blocks  of  four,  four  PE's 
apart,  and  can  be  brought  to  the  same  PE  in  one  unit  route.   Across  the  boun- 
daries of  the  blocks  of  four,  elements  are  distance  five  apart.   Similarly, 

for  0  <  i  <  6,  a   and  a.  _  .  are  five  PE's  apart  and  can  be  brought  to  the 

-   -     ij      i+l,  J 
same  PE  in  two  unit  routes.   For  8  <  i  <  15,  a. .  and  a.  .  .  are  five  PE's 

-   ~      ij      i,J+l 
apart  while  for  8  <  i  <  14,  a. .  and  a.  _  .  are  four  PE's  apart  in  blocks  of 

-   -      ij      i+l, J 
four  with  five  PE  distance  at  the  block  boundaries.   The  distances  between 


=7; 


and  an.  are  irregular  but  these  rows  do  not  interact  while  so  labeled. 
By  moving  the  computed  transformation  matrices  T  ,  .  .  .  ,  T  to 


the  control  unit,  pairs  (T  ,  T  ),  (T  ,  T  ),  etc.  can  be  broadcast  to  the 
appropriate  PE's.   Figure  4  shows  the  first  two  steps  of  this  process.   Line 
1  shows  the  subscripts  of  the  a. .  which  may  be  accessed  in  one  memory  cycle 

and  line  2  shows  the  subscripts  of  the  A. .  partitions  to  which  the  a. .  of 

1J  1J  t 

line  1  belong.   By  broadcasting  T  and  T  from  the  control  unit,  TAT, 

■£  m  0  1  0  00  0 

T_A. T  ,  and  TAT   are  computed  in  parallel.   Lines  3  and  4  show  similar 
subscripts  for  the  second  step  of  the  processs,  during  which  TnA  T„,  T_A  T  , 

-j-  .  0  Od  cL  U  Uj  j 

TAT,  and  TAT  are  computed.   Since  T  and  T  were  broadcast  in  step  1, 

they  are  still  available  in  step  2.   The  2x2  matrix  multiplications  are 

facilitated  by  spacing  the  adjacent  row  and  column  elements  one  or  two  unit 

routes  apart.   On  the  next  step  A^,  ,  A^,  An,   and  An ,_  will  be  transformed. 

04   05   14,      15 

This  processing  is  continued  by  rows  until  the  entire  matrix  has  been  trans- 
formed.  Notice  that  at  the  main  diagonal  this  procedure  suffers  from  an 
efficiency  of  approximately  50%  if  computed  in  the  straightforward  way  shown 
above;  this  point  will  be  discussed  later. 

The  total  time  to  carry  out  one  transformation  on  a  6k   x  6k   upper 
triangular  matrix  is  approximately  6  milliseconds  using  6k   PE's.   The  time 
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to  transform  a  square  6k   x  6k   matrix  is  less  than  12  milliseconds. 

Next,  the  rows  as  shown  in  Equation  k   are  permuted.   The  storage 

scheme  of  Figure  2  allows  this  to  he  done  "by  moving  relatively  few  elements 

of  the  matrix.   Figure  5  shows  the  resulting  matrix  in  which  rows  0,  1,  and 

8  and  columns  k,    8,  and  12  have  been  moved.   Each  column  is  moved  left  one, 

row  0  is  moved  right  three  unit  routes,  and  rows  1  and  8  are  moved  somewhat 

irregularly.   However,  the  total  amount  of  routing  time  involved  is  negligible 

with  respect  to  the  transformation  time.   In  general,  three  rows  andvP  -  1 

■  E 

columns  will  he  moved  in  a  similar  way. 

Figure  6  is  a  relabeled  version  of  Figure  5,    showing  the  new  names 
of  the  shuffled  elements.   Note  by  comparison  with  Figure  2  that  the  origin 
has  moved  nine  PE's  to  the  right  and  one  row  down  in  PE  memory.   This  is  a 
matter  which  is  easy  to  accommodate  in  a  program  for  carrying  out  the  Jacobi 
process.   At  this  point  the  entire  process  is  repeated,  started  by  computing 
a  new  set  of  transformations.   After  n  -  1  such  steps  the  second  shuffle  is 
performed  according  to  Equation  5- 

The  second  shuffle  requires  less  routing  than  the  first.   Row  0's 
routing  is  similar  to  the  routing  of  row  1  in  the  above  discussion,  with 
columns  K,    8,  and  12  pushed  back  distance  one  to  accommodate  the  new  column. 

3-3  Efficiency 

First,  an  efficiency  calculation  for  the  simple  process  described 

above  shall  be  performed,  operating  on  a  (kvP_  x  k -s/P  )  matrix  using  P  PE's. 

E        E  E 

k-1 

Assuming  one  time  unit  to  operate  on  a  (\Tp  x  vP  )  partition,  Z  =  k(k  -  l)/2 

E     E  , 

1=1 

time  steps  is  used  for  square  partitions  plus  k  time  steps  for  triangular  par- 
titions on  the  main  diagonal.   If  the  elements  which  are  to  be  made  zero  are 
not  computed  (just  set  to  zero),  only  —  steps  should  be  used  to  transform  the 
triangular  partitions  on  the  main  diagonal.   Thus,  there  is  an  efficiency  of 

k(k  -  l)/2  +  k/2         k   . 
k(k  -  l)/2  +  k     ~   k  +  1 

For  a  6k   x  6k   matrix  and  6k   PE  s,  (i.e.,  k  =  8  and \Tp  =  8),  there  is  an 
o  E 

efficiency  of  —  ~ 
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This  same  technique  and  storage  allocation  scheme  may  be  used  on 
matrices  of  dimensions  less  than  the  number  of  PE's.   The  discussion  holds 
of  the  transformation  of  A„  ,  An  ,  A  ,  and  A   operated  on  asfP  x  \Tp 
partition  of  a  larger  matrix.   The  process  may  be  used  repeatedly  on  a  large 

matrix  or  just  once  on  a^/P  xsll?     matrix.   Full  efficiency  can  be  achieved 

E     E 

on  square  matrices  of  dimension  k  s/ P  x  k  v  P  . 

The  maximum  efficiency  which  can  be  achieved  for  the  Jacobi  process 
on  a  parallel  machine  depends  on  how  the  diagonal  partitions  and  any  columns 

in  addition  to  kvP  are  treated.   Suppose  an  attempt  is  made  to  transform 

E 

simultaneously  A  ,  A  ,  A^,  \h>    \^>    and  \cr   °^  "t:iie  example,  and  to  sub- 
sequently repeat  this  pattern  on  the  main  diagonal.   This  procedure  will 
achieve  100$  efficiency  during  the  transformation  of  the  triangular  matrices, 
but  an  irregular  route  of  two  elements  will  be  incurred  since  a   and  a 
are  in  the  same  PE  as  are  a   and  a    . 

U<d  y  ,  11 

Matrices  of  dimension  larger  than  6k   x  6k   may  be  partitioned  into 
a  number  of  blocks  of  dimension  6k   x  6k   or  smaller.   For  example,  a  100  x  100 
matrix  becomes  a  6k   x  6k   upper  triangular  partition,  a  36  x  36  upper  trian- 
gular partition  and  a  6k   x  36  rectangular  partition.   All  rows  of  rectangular 
partitions  are  stored  according  to  the  mapping  given  earlier  for  rows  between 
0  and  Pp/2  _  !•   In  other  words,  the  lower  part  of  the  matrix  is  not  packed 
into  holes  in  the  upper  part,  but  is  stored  in  rows  just  like  the  upper  part. 
In  applying  the  Jacobi  process  to  such  a  rectangular  matrix,  proceed  as  above 

transforming  blocks  of  dimension  v  P  x\)P  and  achieving  full  efficiency  on 

E      E 

partitions  of  dimension  p  \]  P  x  q  \J  P  .   If  the  dimension  of  the  matrix  is  not 

E        E 

a  multiple  of nTp  ^  the  leftover  columns  may  be  handled  with  100$  efficiency 
if  the  transformations  are  treated  as  column  operations.   Entire  columns  are 
available  in  one  memory  cycle  using  the  storage  allocation  scheme  described 
above. 


Thus  it  seems  reasonable  to  conclude  that  efficiencies  of  90%  or 
greater  may  be  attained  for  any  size  matrix  without  inordinate  programming 
difficulties. 
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3-  k     Disk  Considerations 

2 
Large  matrices  that  do  not  fit  into  primary  memory  are 

partitioned  into  blocks  of  6^-  x  Gk.      The  number  of  rows  of  "blocks  is 


m    = 


L 


N  -  1 


+  1,  where  N  >  Gk   is  the  size  of  the  matrix.   Let  the  number 


of  any  block  b  be  given  by 


C-l  . 
b  =   R  +    Z   k 

k=0 


(C,R  =   1,  2,  ...  ,  NR) 


where  R  is  the  row  number  of  the  block  and  C  is  the  column  number  of  the 


block.   The  blocks 


R-l 
R  +    Z   k,  R  =  (1,  2,    .  .  .  ,  NR), 
k=0 


are  brought  in  from  the  disk  sequentially  to  be  subjected  to  an  orthogonal 

transformation,  Equation  0.   Thus  all  the  rotation  angles  are  determined  and 

the  rest  of  the  blocks,  C-l 

b  =  R  +   Z    k,  R  =  (0,  1,  .  .  .  ,  NR)  C  /  R 
k=0 

are  fetched  from  the  disk  to  complete  the  given  transformation  of  the 
whole  matrix.   Some  of  the  rows  and  columns  of  each  submatrix  are  stored 
in  arrays  in  primary  memory,  for  purposes  of  the  shifting  process  that 
takes  place  after  the  whole  matrix  is  subjected  to  a  transformation.   There- 
fore, the  following  elements  will  be  stored  in  primary  memory  for  shifting 
the  second  row  and  second  column: 


Submatrix 


Elements  to  be  stored 


R-l 
1.   b  =  R  +   Z  k 
k=0 


(R  >  1) 


first  row  of  the  submatrix,  i.e., 
the  elements  a. . . 

i  =  6U(R  -  1) 

3   =   6U(R  -   1),    .    .    .    ,    Gk  R   -   1 


2 


The  results  of  this  section  are  valid  for  various  secondary  storage 
devices,  e.g.,  drum,  delay  line  or  the  head  per  track  disk  which  we 
discuss. 
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C-l 
2.   b  =  R  +   Z        k 
k=0 

(R  >  1,  C  >  R) 


C-l 
3.   b  =  1  +  Z    k 
k=0 

C  >  1 


first  row  and  first  column  of  the 
submatrix,  i.e.,  the  elements 


(\i)  and  (ars)  ' 

k  =  6h(R  -  1) 

I  =  6k(c  -  1),    . 

s   =  6U(c  -  l) 

r  =  6U(R  -  1),  . 


.  ,  6U  c  -  l 

.  ,  6k   R  -  1 


second  row  and  first  column  of  each 
submatrix,  i.e.,  the  element 

(a   )  and  [a,  \  . 
pqy     \  tu/ 

p  =  1 

q  =  6U(C  -  1),  .  .  .  ,  6U  C  -  1 

u  =  6k(c  -   l) 

t  =  0,  .  .  .  ,  63 


1*.   b  =  1 


second  row  and  second  column  of  the 
submatrix,  i.e.,  the  elements 


(a   )  and  [a   )  . 
vwy      ^  xyj 

.  •  ,  63 


v  =  1 

w  =  1, 

x  =  0 

y  =  i 


For  the  first  row  and  first  column  shifting,  the  elements  to  be  stored 
in  the  core  are  the  same  as  above  except  those  of  items  (3)  and  (k) : 
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t 


Submatrix  Elements  to  be  stored 

C-l 


3  .  b  =  1  +   £   k  first  row  and  column  of  each  sub- 

matrix,  i.e.,  the  element  /a   I  , 

\  pqy 


w 


c  >  1  p  =  o 

q  =  6k(C   -  1),  .  .  .  ,  6k   C-  1 

u  =  6k(c  -   l) 

t  =  o,  .  .  .  ,  63 

k   .     b  =  1  first  row  only,  i.e.,  the  elements 


fa  ), 


v  =  0 

w  =  0,  .  .  .  ,  63 

The  total  number  of  elements  to  be  stored  in  primary  memory  in  each 
kind  of  shifting  is  N(NR  +  1). 

The  primary  memory  of  6k  PE's  is  capable  of  holding  32  partitions 
6k  x  6k'}  the  transformation  time  for  a  square  block  is  about  12  ms,  and  the 
rotational  delay  of  the  head-per-track  disk  is  kO  ms.  If  the  partitions  on 
the  disk  are  laid  out  in  such  a  way  that  four  of  them  are  accessed  at  once, 
then  kQ  ms  later  the  four  6k  x  6k  partitions  will  have  been  transformed  and 
the  disk  will  have  rotated  more  than  one  complete  revolution.  Thus,  if  the 
requests  for  four  partitions  in  the  disk  queuer  are  kept  one  rotation  ahead, 
the  rotational  latency  of  the  disk  can  be  masked  in  computing  on  large  matrices . 
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k.      HOUSEHOLDER'S  METHOD 


k. 1  Theoretical  Background 

This  method  [3?^]  reduces  a  symmetric  matrix 

A  =  [a. .]  {l,j   =   0, 1,2, . . . ,n-l)  to  the  tridiagonal  form  T  using  elementary 

-'-J 
Hermitian  orthogonal  matrices.   There  are  (n-2)  steps  in  this  reduction. 

We  define  the  matrices  A  "by  the  relations, 

r 

Ar  =  Pr  Ar_x  Pr      (r  =  1,2,..., n-2)  (6) 

where  A  =  A  is  the  original  matrix.   The  matrices  P  are  defined  by 

P  =  I  -  !£r  (7) 

2 
2K 
r 

in  which  u   is  a  row  vector  (whose  first  r  elements  are  zero)  given  by, 

u   =  (0,...,0,  a     +  S  ,  a  _    _ , ,  a  _    ,N         (8) 

r  r-l,r  -  r   r-l.r+1'     '   r-l,n-l) 

s'|l^  (9) 

2 
and,      2K  =  S   (S  +  a  .   )  (10) 

r    r   r  —  r-1,  r 

where  the  elements  a. .  are  those  of  A  n .   The  signs  in  both  expressions 

ij  r-1         &  * 

(8)  and  (10)  are  chosen  such  that, 

[a+s|=|an|+S.  (11) 

r-l,r  —     r1         '    r-l,r'  r 

Therefore,  in  the  r-th  step  (Af  =  Pf  A    P  )  [n  -  (r  +  l)]  zeros  are  in- 
troduced in  the  r-th  column  and  row  of  A    without  destroying  the  zeros 
introduced  in  the  previous  step. 

In  order  to  take  advantage  of  symmetry,  the  matrices  A  may  be 
defined  by  the  relations  [ 3] , 

-18- 


t     t  .     . 

A=An-qu-uq  (12) 

r    r-1    r  r    r  r 

where  u  is  as  given  above,  and  q  is  given  "by 

t 

%  =  pr  -  1     r  r  Pr  C^) 

2     2 

2K 
r 

in  which, 

P  =  A  _   Ur  .  (Ik) 

r    r-1  — 

2K 

r 

The  total  number  of  multiplications  required  to  reduce  the  matrix  to  the 

3 
tridiagonal  form  is  essentially  2  n  . 

3 

The  eigenvalues  of  the  tridiagonal  matrix  may  be  obtained  by 
several  methods.   For  this  parallel  computer  it  is  proposed  that  if  all  the 
eigenvalues  are  required,  then  the  QR  algorithm  with  origin  shift  [3;5>6]  is 
used,  (this  algorithm  is  discussed  for  real  nonsymmetric  matrices  in  the 
following  section).   However,  if  one  is  interested  in  the  general  distribu- 
tion of  the  eigenvalues  and  not  in  their  accurate  values,  or  the  eigenvalues 
in  a  given  interval,  the  bisection  method  [3*7]  may  1°e   used.   In  order  to 
show  how  the  method  of  bisection  would  be  used  for  a  parallel  computer, 
assume  that  it  is  necessary  to  obtain  all  the  eigenvalues  of  the  resulting 
tridiagonal  matrix  T  using  such  a  method.   For  the  sake  of  illustration, 
assume  also  that  all  the  subdiagonal  elements  are  non-zero  and  hence  the 
matrix  has  simple  eigenvalues.   The  eigenvalues  lie  in  the  interval  [-b,b], 
where  b  =    T  j    =  max  E  I  t. . I .   This  interval  is  then  divided  into 

i  loo  ,  X3 

63  subintervals  and  for  all  ju,  (k  =  l,a,  . .  .  ,6k),   where  u  =  -h,  and  n^,  =  +b, 
the  quantities, 
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W  -   (Vl,!-!  "  \>    fi-lK'  "  V2,i-1  W^H1  "  2.3,..., J 

are  evaluated  in  parallel,  one  PE  to  each  id   .   For  each  jll  the  number  of 

K.  K. 

agreements  in  sign  s(jil  )  of  consecutive  members  of  the  sequence  f  (ju,  ) , 
f  (ll),  .  ...,  f  (jll  )  is  the  number  of  eigenvalues  larger  than  jll  .   The 
process  is  repeated  for  those  intervals  [jll  ,  jll   ]  that  contain  more  than 
one  eigenvalue  until  all  the  eigenvalues  are  separated.   Assuming  that  the 
matrix  is  of  order  6k,    6U  intervals,  each  containing  an  eigenvalue  are 
established.   By  assigning  each  interval  to  a  PE  and  by  successive  bisection 
of  each  interval,  using  the  Sturm  sequence  property,  all  the  eigenvalues 
are  obtained. 

Once  all  the  eigenvalues  are  available,  the  eigenvectors  of  the 
tridiagonal  matrix  are  obtained  by  the  method  of  inverse  iteration  [3,8]. 
Again  this  can  be  done  in  parallel,  each  PE  evaluating  an  eigenvector. 
Finally  if  Z.  (i  =  l,2,...,n)  is  an  eigenvector  of  the  tridiagonal  matrix 
T,  then  the  corresponding  eigenvector  Y.  of  the  original  matrix  A  is  com- 
puted by, 


Y.  =  P,  P„  P  „  Z. .  (16) 

x    12       n-2  l 


k.2     Implementation  of  Householder  Tridiagonalization 

We  shall  consider  the  implementation  of  Householder's  method 
on  a  16-PE  machine  using  the  storage  allocation  scheme  of  Figure  7  f°r 
an  upper  triangular  16  X  16  matrix.   This  scheme  is  somewhat  less  com- 
plicated than  Figure  2  since  it  is  not  necessary  to  provide  for  shuffling.  In 
particular,  adjacent  row  and  column  elements  are  always  a  unit  route 
from  each  other,  except  between  the  seventh  and  eighth  rows.   The  lower 
half  of  the  matrix  is  stored  in  reverse  order  to  allow  efficient  com- 
putation on  this  part  of  the  matrix.   This  scheme  also  allows  access  to 
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any  row  with  one  memory  fetch.   Operations  may  be  performed  on  P„  x  P 

E    E 

blocks  at  any  step,  where  the  size  of  the  matrix  n  is  equal  to  the  number 

of  PE's   P_. 


PE    -   1, 


The  memory  map  for  this   scheme  consists   of  two  parts: 

E 

2 


a.  . 


loc   I,   PE        ((i(mod/PE))  nTpe  +  j)    (mod  PE)  1  ;  (17) 


P  P 

if     E<i<     E-l 


a.  . 


loc  PE   -   I,    PE   [~(PE   -  1)    -    ((i(mod/PE))/PE  +  j)    (mod  PE)  1 


A  step-by-step  discussion  of  the  process   follows. 

As  the   first   step   (r  =  l),    consider  row  0  of  the  matrix,    (anry    am  > 

.    .    . ,    a.     ._).      This  row  is   in  location  0,    Figure  7«      The   first  diagonal 

element  of  the  tri diagonal  matrix  T  is  t       =  a_^.      The  element  t_n    takes 

e  00    00  01 

the  value  of  +  S  determined  by  Equation  9  (with  r  =  l)  and  sign  opposite  to 
that  determined  by  Equation  11.   The  rest  of  the  elements  of  the  first  row  of 
T  are,  of  course,  zero.   The  vector  u~  is  then  readily  constructed  by  Equa- 
tions 8,  9,    and  10  in  a  time  which  is  negligible  compared  to  the  following 
calculations.   The  vectors  p..  and  q  are  then  computed  with  almost  full 
efficiency  using  Equations  13  and  lk. 

Once  u  ,  q  ,  and  the  first  row  of  the  tri-diagonal  matrix  T 

are  constructed,  matrix  A  is  obtained  using  Equation  12.   However,  this 

transformation  can  be  applied  on  the  individual  \/P  x  vP  submatrices  A^„, 

E     E  -K^ 

where  R,  C  =  (0,  1,  .  .  .  ,  nTp^  -  l),  thus  dividing  the  vectors  u  and  q 

E  J.       J. 

into  \/ P  subvectors  each  of  \l  P  components.   The  submatrices  A^  at  any 
step  r  are  given  by, 

(V)  -  <v)   ,  -  ^(E)  *(c)t  -  »(E)  ^(c)t  ^ 

IRC   r  TRC  r-1  r  r  r  r 

for  all  R  and  C   such  that  R  <  C. 
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It  may  be  noted  from  Figure  7  that  for  the  steps  r  =  (l,  2,    .    .    .    , 
\Tp  ),  the  efficiency  of  parallel  computation  on  the  submatrices  A ,    A ,    A , 
and  A ,  Figure  8,  descreases  as  r  increases,  while  that  of  the  remaining  off- 
diagonal  submatrices  A  ,  A     ,  and  A   is  100%.   Moreover,  efficiencies  of 
higher  than  50%  can  be  obtained  for  the  diagonal  submatrices  A  ,  ^pp>    and-  ^v 
The  entire  time  required  for  reducing  a  6k   x  6k   symmetric  matrix  to  the  tri- 
di agonal  form  on  a  6U-PE  machine  is  less  than  k   ms. 


k. 3  Efficiency  of  the  Tridiagonalization 

The  efficiency  of  the  Householder  process s  on  a  symmetric  P  x  P 
matrix  is  computed  using  the  storage  allocation  scheme  of  Figure  7-   Assume 
that  computations  are  performed  on-Jp  x^^v   blocks  and  that  no  attempt  is 
made  to  mask  inefficiencies  incurred  due  to  the  annihilation  of  rows  and 

columns  within  particular  v  P  x  s/P  blocks.   It  is  assumed  that  diagonal 

iii     E 

blocks  are  transformed  in  pairs  and  that  no  inefficiency  is  incurred  when 
there  is  an  even  number  of  such  blocks.   In  Figure  7  it  may  be  noted  that  for 
example,  the  upper  left  and  lower  right  triangular  blocks  may  be  accessed  and 
transformed  simultaneously,  i.e.,  the  elements  with  subscripts  (0,0;  0,1;  0,2; 
0,3;  15, 15;  1,1;  1,2;  1,3;  1^,15;  1^,1^;  2,2;  2,3;  13, 15;  13,1^5  13,13;  3,3). 
This  pairing  can  be  observed  along  the  entire  diagonal.   It  is  also  possible 
to  overlap  parts  of  square  blocks  which  have  been  annihilated.   For  example, 
-when  "J  F   /2  rows  in  a  strip  of  block  are  set  to  zero,  adjacent  blocks  could  be 
paired  and  100%  efficiency  regained.   Similarly  after  nTp^A,  nTP™/8,  .  .  .  , 
etc.  rows  are  annihilated,  full  efficiency  could  be  regained.   This  latter 
matter  requires  some  delicate  discussion  as  well  as  programming  which  will 
not  be  included  here. 

The  number  of  operations  required  to  carry  out  the  equations  of 
Householder's  process  on  a  triangular  matrix  of  decreasing  dimension  is 

a=i  i=i       d=i  5 
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The  number  of  operations  performed  by  a  parallel  maching  using 

the  storage  allocation  scheme  of  Figure  7 ,   with  the  above  assumptions, 

follows.   There  are  P„  elements  per  s  v     XvP„  block  and  each  block  is 

ill  ±L.      ±Li 

swept  \/P„  times  per  pass.   This  product  times  the  total  number  of  "v/P,-,  x^P^ 
blocks  to  be  transformed  in  all  passes  is 


\Tp  p 

IE 


/pE-l 
Z 

3=1 


Z 
i=l 


i  +  2 


Z 
i=l 


where  the  first  term  in  the  brackets  counts  the  number  of  rectangular 
blocks,  and  the  second  term  counts  the  number  of  paired  triangular  blocks 
on  the  diagonal.   This  is  equal  to 


p|  +  2p|  +  i.5P^PE 


6 


Thus  for  the  Householder  process  on  symmetric  matrices  we  have: 


processing  efficiency  = 


ops,  required 
ops.  used 


PE  +  3PE  ♦  2 


P  +  1.5sTp   .  P  +  2P 

E     ^  E    E     E 


If  n  =  P =  6k   this  yields  an  efficiency  of  approximately  86%.   With  the 
introduction  of  the  pairing  of  partially  annihilated  rectangular  blocks 
mentioned  above,  efficiencies  of  well  over  90%  can  be  achieved. 

As  in  our  discussion  of  the  Jacobi  process,  matrices  of  larger 
and  smaller  dimensions  may  be  handled  using  the  same  storage  scheme. 
Matrices  of  dimension  smaller  than  the  number  of  PE's  are  handled  using 
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vP  X^Pg  "blocks.   Those  of  dimension  larger  than  the  machine  size 

are  partitioned  into  triangular  and  rectangular  blocks.   The  rectangular 

p 
blocks  are  stored  using  the  memory  map  specified  for  rows  0  <  i  <  E  -  1 

2 
for  all  rows,  since  there  are  no  holes.   In  the  case  of  either  large  or 

small  matrices  of  dimension  other  than  a  multiple  of  Jp  the  above  effi- 
ciency formula  is  a  close  approximation  of  the  correct  value. 


k.k     Disk  Considerations 

Large  matrices  that  do  not  fit  the  primary  memory  are  parti- 
tioned into  blocks  of  6k   x  6k.      The  number  of  rows  of  blocks  is, 


KR  =      rs  +   1 


[Hr] 


where  N  >  6k   is  the  size  of  the  matrix.   The  procedure  is  the  same  as 

above.   At  the  r-th  step,  the  first  row  of  the  remaining  matrix  is  fetched 

from  the  disk  and  the  vectors  u  ,  p  ,  and  q  are  constructed.   The  6^-  x  6k 

r   r       r 

blocks  are  then  fetched  from  the  disk  one  at  a  time  and  transformed  as 
discussed  above.   The  efficiency  of  the  computation  increases  as  the  size 
of  the  matrix  gets  larger. 

The  elements  of  the  resulting  tridiagonal  matrix  are  stored  in 
the  primary  storage.   Because  of  symmetry,  the  number  of  these  elements  is 
only  (2N  -  l) .   The  time  taken  for  the  transformation  of  an  off-diagonal 
6k   x  6k  block,  Equation  18,  on  a  6U-PE  machine  is  approximately  6  ms., 
assuming  that  the  vectors  u,  p,  and  q  are  already  available.   Thus  if 
requests  for  eight  6k   x  6k   partitions  in  the  disk  queuer  are  kept  one 
rotation  ahead,  the  rotational  latency  of  the  disk  would  be  masked. 
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5-   THE  QR  ALGORITHM 

5-1  Theoretical  Background 

Throughout  this  discussion  it  is  assumed  that  the  matrix 
A  .  [a..]  (i,J  .  0,1,2,-  ...n-1)  nnder  consideration  is  Doth  real  and  non- 
singular.   The  QR  transformation  consists  of  forming  a  sequence  of  matrices 
A.  (k  =  1,2, . ..)  by  the  relation  [5], 

\+i  ■  «£ \\  '  (17) 

where  Q  is  an  orthogonal  matrix  such  that 

K. 

is  an  upper  triangular  matrix.   In  general,  for  large  k  the  matrix  JL  tends 
to  a  form  in  which  all  the  eigenvalues  are  either  isolated  on  the  diagonal 
or  are  the  eigenvalues  of  2  X  2  diagonal  submatrices. 

The  QR  algorithm  is  always  used  after  the  original  matrix  is 
reduced  to  the  upper  Hessehberg  form,  because: 

(1)  The  upper  Hessenberg  form  is  invariant  under  the  QR  trans- 
formation [  5]  • 

(2)  The  number  of  multiplications  involved  in  a  QR  transfor- 

2  3 

mat ion  is  greatly  reduced,  n  after  reduction  vs.  n  for  the  unreduced 

matrix,  where  n  is  the  order  of  the  matrix. 

(3)  The  QR  algorithm  has  strong  convergence  properties  when 
applied  to  upper  Hessenberg  matrices  [9]- 

There  are  several  stable  methods  for  reducing  any  square  matrix  to  upper 
Hessenberg  form  [3,^]«   The  method  of  Householder,  which  is  described  in 
the  previous  section,  may  be  used  for  that  purpose.   The  reduction  to 
upper  Hessenberg  form  is  therefore  completed  in  (n  -  2)  steps,  the  r-th 
of  which  is  given  by, 

Ar  =  Ar_1  -  vr  p^  -  (qr  -  ar   ur)  v^      (r  =  1,2,...,  n-2)   (19) 
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where  A  is  the  original  matrix  A  and, 


*r  =  &>  —  >   °>    \,r-l  ±  V  ar+l,r-l' >an-l,r-l}       (20> 

(21) 


P1  =  u*   A  .  (22) 
r    r  r-1 

q  =  A  _  u  (23) 

^r    r-1  r  v  ' 


v  =       Ur  (214-) 

r   — 7 

S  (S  +  a 77 
r  r  —  r,r-l; 

°r  =  Pr  Vr  '  (25) 


The  sign  of  (a    ,  +  S  )  is  chosen  as  indicated  in  Equation  11.   The  total 
r,r-l  -  rJ  u  ,- 

number  of  multiplications  involved  in  this  reducation  is  essentially  —  n  . 

The  QR  algorithm  is  invariably  used  with  origin  shifts  [3,5*6,10] 
described  by, 

<(\"  ?k  r)  "V  Vi  =  <  \\    (k.1,2,...)       (26) 

where  Q,  is  orthogonal,  R  is  upper  triangular,  f   is  the  origin  shift,  and 
A^  is  the  upper  Hessenberg  form  of  the  original  matrix  A.   In  order  to  speed 
up  convergence,  the  origin  shift  £,  is  chosen  such  that  it  ultimately  ap- 
proaches  an  eigenvalue  of  A.   It  often  happens  that  the  matrix  A  has  complex 
conjugate  eigenvalues  which  lead  to  a  complex  value  of  t,  .   Francis  [5] 
proposed  the  method  of  double  origin  shift  which  combines  two  QR  iterations 
in  one,  avoiding  complex  arithmetic.   This  method  may  be  described  by  the 
relations, 


W. 


(\  -  ^  x»  <\  -  w  «  -  \  (27) 


\+2  "  \\  \  (28) 


■27- 


where  W.  =  Q.  Q,  n  is  orthogonal,  t      and  t  are  the  origin  shifts,  and 

K     k   K+  -L  K       K+  L 

IL  =  Rn    R.  is  upper  triangular.   The  origin  shifts  (\  and  t.         are  usually 
k    k+  Ik.  -k      K+  JL 

taken  as  the  eigenvalues  of  the  last  2x2  diagonal  submatrix,  and  since  A 
is  real  they  are  either  both  real  or  complex  conjugate.   Hence  the  matrix 
B  =  (A  -  (\  I)  (A,  -  £,    i)  is  real  and  no  complex  arithmetic  is  involved. 
However,  since  A  is  nonsingular,  A  p  and  W,  are  determined  by  the  first 

>nly,  and  the 

are  positive. 


column  of  W,  only,  and  the  determination  is  unique  if  the  subdiagonal  ele- 

K. 


ments  of 


\+2 


The  first  column  of  W,  is  the  same  as  the  first 

k 


column  of  the  elementary  Hermitian  orthogonal  matrix  N  that  eliminates  the 


elements  of  the  first  column  of  the  matrix  B,  below  the  diagonal. 

k 


The  first 


column  of  B  has  only  three  non-zero  elements. 

K 


They  are  given  by, 


b(k)  _  (k)(  (k) 


e   )   +  a(k)  a(k)  +  Tl 
-00  ''~00   V  +  aoi  aio  +  ''1 


(k)  _   (k),  (k)    (k) 

10  "  aio  laoo  +  an 

(k)  _   (k)  (k) 

20  "  ai0  a21 


^ 


(29) 


where, 


ek  =^k  +  5 


(k)        (k) 
k+1  "  an-2,n-2  +  an-l,n-l 


30) 


-  t      r  -  a(k)      a(k)       a(l°      a^ 

k   ^k  ^k+i  "   n-2,n-2  n-l,n-l    n-2,n-l  n-l,n-2 


Thus  the  matrix  N  is  given  by, 


N  =  I 
o 


t 
o  o 


(3D 


!y 


o'  '2 


where, 
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yQ   =  (1,  b  ,  tQ,  0, ,  0) 

„(*) 

s  =  10 
o 


^7  So 


(32) 


.(*) 


t  =   20 
o 


^¥TT 


t 


^(k))2+  (b(k))2+  (b(k^2 

-  00  ;     l  10  ;     k  20 


J 


Therefore,  the  matrix  C.  =  N  A  N  is  of  the  form, 


X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 
X 

X 
X 
X 

X 
X 
X 

where  the  underlined  elements  are  those  elements  of  A,  affected  by  the  trans- 
formation.  Now  the  matrix  C   can  he  reduced  to  the  upper  Hessenberg  from 
A,  ?  by  the  orthogonal  transformations, 


\+2" 


n-2  n-3 


N^  N*  A  N  Nn 

1   O   K   O   1 


n-3  n-2 


(33) 


where  the  elementary  Hermitian  matrix  N  is  chosen  to  eliminate  the 
elements  of  the  first  column  of  C  below  the  sub diagonal.   Therefore, 

C2  =  Nl  Cl  Nl  Wil1  be  °f  the  f°rm 
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Consequently,  N  is  chosen  to  eliminate  the  elements  of  the  second  column 
of  C  below  the  sub diagonal,  and  so  forth.   In  general,  the  orthogonal 
matrices  N  (r  =  1,2,  ...,n-2)  are  given  by, 


where, 


t 
2Y  y 

N  =  I  -  r  r 
r       


y 


3h) 


yr  =  (o, 


,o,i,sr,tr,o, ,o, 


(35) 


in  which  the  first  r  elements  are  zero, 

(r) 
c 
s  =   r+l,r-l 

r  "771 

cv  ;  _  +  S 
r,  r-1  —  r 

c(r) 

t  =  °r+2,r-l 

r   ~TrT 

cK    *        +    S 
r,  r-1  —  r 


S  = 
r 


|7cM 

I   r,r- 


1>     +  lCr+l 


,r-lA    l  r+2,r-l;  ! 
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(37) 


(38) 


After  each  iteration  the  subdiagonal  elements  are  examined  for 
possible  partitioning  into  two  or  more  smaller  upper  Hessenberg  matrices 
and  the  iterations  continued  with  the  bottom  submatrix.   Wilkinson  [3]  and 
Martin  [6]  mention  that,  using  the  double  origin  shift  method,  the  average 
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number  of  iterations  per  eigenvalue  for  most  matrices  that  are  encountered 
in  practice  is  less  than  two. 


5*2  Implementation 

Figure  8  shows  the  storage  allocation  scheme  for  a  16  x  16  non- 
symmetric  matrix  on  a  16  EE  machine.  This  storage  scheme  maps  an  element 
a.  .  into  the  memory  as  follows: 


a...  -  loc  >Tp   _i_   +  .  _j_  .,   PE(  (i(mod  Vpg))  x  >/pe  +  .  _i_  i  +  j  ) 

(mod  PE)       (39) 


where  P„  is  the  number  of  PE's  in  the  machine  and  LxJ  is  the  greatest  in- 

teger  less  than  x.   The  implementation  of  the  reduction  of  the  original 
matrix  to  the  upper-Hessenberg  form  and  the  QR-iterations  is  similar  to  that 
of  the  Householder  tridiagonalization  explained  in  Section  k.2,    and  hence 
will  not  be  discussed  here  in  detail. 

5-3  Efficiency 

As  a  direct  extension  of  the  discussions  in  Sections  k.2   and  k<3 
it  can  be  shown  that  using  Equations  19  -  25  the  reduction  to  the  upper- 
Hessenberg  of  a  6k   x  6k   matrix  would  take  less  than  10  ms  on  a  6k- PE 
machine  with  an  efficiency  higher  than  86$.   It  will  be  shown  that  the  Q,R 
transformations  dominate  the  overall  efficiency  calculations.   Detailed 
examination  of  the  efficiency  of  one  step  in  double  QR-iteration  C  _  = 
NCI  follows,  where  N  is  defined  by  Equations  3^  -  38.   By  virtue  of 


Equations  19  -  25,  C  ,  may  be  written  as, 


Cr+1  =  Cr  -  vr  p*  -  (%  -  ar  j)   v*  (k0) 
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where  pr  =  yrC.,  g^  =  Cryr,  vy  =    2    yr,   «p  =  Prvr,  and  yr  is  defined 

llyrll2 

by  Equations  35-38.   On  a  parallel  computer  the  efficiency  of  the  process 
used  for  finding  the  square  roots  affects  the  overall  efficiency  of  the 
Q,R- algorithm.   Therefore,  in  order  to  minimize  the  inefficiencies,  we 
describe  the  following  square  root  routine  [11].   Any  positive  floating 
number  Z  may  be  expressed  as  Z  =  2   (x)  where  either  -  <  x  <  1  or 
j-  <  x  <  — .   Thus  v  Z  =  2  (v  x)  and  the  problem  is  reduced  to  finding  the 
square  root  of  x.   We  use  the  recursive  relation, 


Vl  =  ^  (3  "  X  WiP     4  =  0,  1,  2,  ...  (1+1) 


where, 


2 
w  =  a  +  bx  +  ex  (1+2) 


in  which, 


a  =  3.1621U 

b  =-5.86118  for  i  <  x  <  \  (1+3) 

c  =  k. 75000 


and, 


a  =  2.22152 

b  =-2.031^6  '  for  i  <  x  <  1  (1+1+) 

c  =  0.81250 


Wi  COnVerges  <luad:cat:i-cally  "to   _1_  and  v  x  =  (x)  lim  oj,. 
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i  iterations  prove  sufficient  for  evaluating  the  square  root  of  x  to 
bits  of  accuracy.   Assuming  this  accuracy  satisfactory,  six   is  given 


sfx  =  o  px  -  3  P3x2  -  1  x2  73  (1^5) 

k  5"        2 


here, 


P  =3-.   "ix.3 
2   0   2    0 


7  -  3  3  -  1  x  p3  (U6) 

2     2 


a)  is  given  by  Equations  U2  -  UU. 

Figure  9  shows  how  the  arithemtic  operations  would  be  performed 
parallel  computer  for  finding  the  square  root, 


S   = 
r 


r  (r)   \2   /-  (r)     \2   /  (r)     n2"H 
L^^r-l)  +  K+l,   r-l>  +  (°r+2,  r-l}  J  > 


dements  of  the  vector  y  , 


(r)  (r) 

c  c 

s   =  r+1.  r-1  ,   t      -     r+2,  r-1     ,   and  the  ratio 

r    (r)    r    (r) 

cv  '      .    +  S  cV  ;   ,  +  S 

r,  r-1  —  r  r,   r-1  —  r 


c^  ,   S 

r,  r-1  +  r  . 


2        S 
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FIGURE     9 
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In  Figure  9  the  operations  at  a  given  level  in  the  tree  are  identical  and 
shown  to  the  right,  and  the  totals  are  shown  below.   Unlabeled  nodes  are 
assumed  to  use  the  result  computed  immediately  above.   In  this  and  subsequent 
discussions,  the  insignificant  routing  and  broadcasting  times  are  neglected. 
Assuming  that  one  division  is  equivalent  to  five  multiplications,  and  two 
additions  are  equivalent  to  one  multiplication,  the  evaluation  of  the  above 
quantities  on  a  serial  machine  would  effectively  require  35  multiplications. 
On  a  P  >  8  machine  the  same  process  requires  only  the  equivalent  of  20 
multiplications  or  10  |U.s.   This  is  clear  from  the  fact  that  the  tree  height 

is  equivalent  to  20  multiplications  and  its  width  does  not  exceed  8.   The 

35/P 

efficiency  of  the  parallel  computation  of  this  part  is  '  '  E  .   This  is  a 

20 
rather  conservative  definition  of  efficiency.   It  assumes  that  the  total  time 

"necessary"  for  a  parallel  machine  to  evaluate  a  function  is  just  the  equiva- 
lent number  of  multiplications  divided  by  the  number  of  PE's  available. 

The  row  vector  p  =  y  C  has  its  first  (r  -  l)  elements  equal  to 

zero.   Its  formation  requires  the  equivalent  of  3(n  -  r+l)  multiplications. 

The  column  vector  q  =  C  y  has  (n  -  r  -  k)    zero  last  elements  for  r  =  1, 

r    r  r 

2,  .  .  .  ,  n  -  5-   Its  construction  requires  the  equivalent  of  (3r  +  10 ) 
multiplications.   Assuming  that  the  size  of  the  matrix  is  n  =  (m)P  where  the 
integer  m  >  1,  the  average  number  of  equivalent  multiplications  required 
for  computing  either  p  or  q  on  a  parallel  computer  is  given  by 

1  5       3 

*-  X   3J  =  t:   (m  +  1) .   In  order  to  calculate  the  overall  efficiency, 
m  .  .,    °        2 

observe  that  for  the  matrix  C  there  are  m  diagonal  P  x  P  submatrices; 
for  each,  the  corresponding  portion  of  either  p  or  q  requires  three 
equivalent  multiplications  with  an  average  efficiency  of  50%,  and  for  each 
of  the  -  m(m  -  l)  P  x  P  submatrices  above  the  diagonal,  the  corresponding 

<C  J_j        J_i 

portion  of  either  p  or  q  also  requires  three  equivalent  multiplications, 
but  with  an  efficiency  of  100%.   Weighing  the  individual  efficiencies  with 
the  corresponding  number  of  equivalent  multiplications,  the  overall  average 
efficiency  for  computing  either  p  or  q  is  calculated  as  follows: 

Eff  f  =   (3m)  0.5  +  |  m(m  -  l)  (1.0)   =   m 

^r   rr  3   /     ~  m  +  1 

3m  +  -  m(m  -  1) 
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Since  v     and  q  contain  some  zero  elements  and  since  the  computation  of 
r      r 

2  t 

v  = v  requires  only  two  multiplications  and  that  of  a  =  p  v 

r    i  I   I  1 2  °r  r   ^r  r 

I  |yrl  I 

requires  four  multiplications,  both  v  and  a     can  be  computed  simultan- 
eously with  p  and  q  respectively.   The  computation  of  the  column  vector 
a  y  which  requires  two  multiplications  can  also  be  performed  simultan- 
eously with  the  computation  of  the  matrix  v  p  •   The  average  number  of 

multiplications  required  for  computing  v  p  on  a  parallel  machine  is  the 
same  as  that  of  p  and  q  and  with  the  same  efficiency.  The  construction 
of  the  vector  (q  -  a     y  )  requires  the  equivalent  of  ^  multiplications 

with  an  efficiency  of  |^-\.   The  average  number  of  multiplications  required 


»)• 


for  forming  the  matrix '(q  -  a     y  )  v  on  a  parallel  computer  is  also 

identical  to  that  of  p  or  q  and  has  the  same  efficiency.   The  same  holds 

r     r     t  t 

for  constructing  the  sum  C  -v  p  -  (q  -  (X     y)v.   Neglecting  the  time 

taken  to  find  the  origin  shifts  for  every  iteration,  it  can  be  concluded 

that  the  total  time  required  for  one  double  QR-iteration  on  a  parallel 

computer  of  P  PE's  and  for  a  matrix  of  size  n  =  (m)P  is  essentially 

Is  E 

((m)P     -  2)  [10-75  +   3-75   (m  +   l)]   jli.s.   with  an  efficiency  of 


0.5   (35/PJ  +  3-75  m  +   (2.25/PT 


'E 


10.75  +  3-75   (m  +  l) 

Thus,  the  time  required  for  one  QR-iteration  for  a  256  x  256  matrix  on  a 
16-PE  machine  (assuming  that  the  matrix  can  be  contained  in  the  primary 
memory)  is  19  ms  and  the  efficiency  of  parallel  computation  is  82$.   If 
PE  =  6k,   then  the  time  taken  is  7-5  ms  with  an  efficiency  of  52$.   Clearly, 
the  minimum  efficiency  of  the  QR-iteration  occurs  when  m  =  1.   For  example, 
for  n  =  PE  =  6k,    the  efficiency  is  22$.   In  order  to  establish  a  rough 
lower  bound  on  the  overall  efficiency  of  the  QR  algorithm,  consider  the 
case  for  which  m  =  1  and  when  the  matrix  gets  deflated  by  two  rows  and  two 
columns  every  two  iterations.   This  efficiency  is  given  by, 
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Eff.    =*  1    £     f .  (U8) 


where, 


j    =4,   6,   8,    ,  ..,    n-2      ,      N  =      fn-5 1  for  n   -  even 


r*i 


J    =  3,    5,   7.    •••,    n-2      ,     N  =         n-k     +1  for  n   -  odd 


M 


and, 


f,   -  _^ (38.5  +  3-75  j 

J        l8.25n 


Thus,    for  n   =  P„   =  6k,   the   QR  algorithm  efficiency   is   approximately  11.50$. 
Is 
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